Preface

R packages

In this practical, a number of R packages are used. If any of them are not installed you can still follow the practical but will not be able to run all of the code. The packages used (with versions that were used to generate the solutions) are:

  • mice (version: 3.3.0)
  • visdat (version: 0.5.2)
  • JointAI (version: 0.4.0)
  • VIM (version: 4.7.0)
  • plyr (version: 1.8.4)
  • corrplot (version: 0.84)

Help files

You can find help files for any function by adding a ? before the name of the function.

Alternatively, you can look up the help pages online at https://www.rdocumentation.org/ or find the whole manual for a package at https://cran.r-project.org/web/packages/available_packages_by_name.html

Dataset

Overview

For this example, we will use the NHANES dataset. To load this dataset, you can use the command file.choose() which opens the explorer and allows you to navigate to the location of the file NHANES_for_practicals.RData on your computer. If you know the path to the file, you can also use load("<path>/NHANES_for_practicals.RData").

Task

Let’s take a first look at the data. Useful functions are dim(), head(), str() and summary().

Solution

## [1] 1000   24
## 'data.frame':    1000 obs. of  24 variables:
##  $ occup   : Factor w/ 3 levels "working","looking for work",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ BMI     : num  27 29.5 29.8 17.5 31.5 ...
##  $ HDL     : num  1.06 1.29 0.85 1.99 NA 1.71 1.19 NA 1.71 1.32 ...
##  $ hypchol : Factor w/ 2 levels "no","yes": 1 2 1 1 NA 1 1 NA 1 1 ...
##  $ hypten  : Factor w/ 2 levels "no","yes": 2 2 1 1 2 1 1 2 1 1 ...
##  $ SBP     : num  NaN 128.7 114.7 96.7 132 ...
##  $ bili    : num  0.9 0.9 0.7 1 NA 0.7 0.5 NA 0.7 0.7 ...
##  $ educ    : Factor w/ 5 levels "Less than 9th grade",..: 3 5 5 5 2 1 2 4 4 3 ...
##  $ HyperMed: Ord.factor w/ 3 levels "no"<"previous"<..: 3 NA NA NA 3 NA NA 3 NA NA ...
##  $ creat   : num  1.05 0.98 0.83 0.72 NA 0.75 0.73 NA 0.89 0.95 ...
##  $ cohort  : chr  "2011" "2011" "2011" "2011" ...
##  $ age     : num  58 57 56 31 79 60 37 68 23 20 ...
##  $ DBP     : num  NaN 94.7 72 65.3 52 ...
##  $ smoke   : Ord.factor w/ 3 levels "never"<"former"<..: 1 1 3 1 1 1 1 1 1 1 ...
##  $ wgt     : num  93 90.7 76.2 52.2 70.8 ...
##  $ alc     : Ord.factor w/ 5 levels "0"<"<=1"<"1-3"<..: 1 5 1 NA 2 NA NA NA 3 2 ...
##  $ WC      : num  99.9 111.6 90.1 76.9 100.2 ...
##  $ hgt     : num  1.85 1.75 1.6 1.73 1.5 ...
##  $ race    : Factor w/ 5 levels "Mexican American",..: 3 2 5 3 4 5 5 4 1 2 ...
##  $ chol    : num  4.76 6.83 2.79 5.09 NA 5.59 6.08 NA 4.03 3.31 ...
##  $ uricacid: num  6.8 5.9 6.1 3.4 NA 5.1 6.1 NA 7.3 3.5 ...
##  $ DM      : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ gender  : Factor w/ 2 levels "male","female": 1 1 1 2 2 2 2 2 1 1 ...
##  $ albu    : num  4.7 4 4.2 4.3 NA 4.1 3.7 NA 4.4 4.6 ...
##               occup          BMI             HDL        hypchol     hypten         SBP        
##  working         :564   Min.   :15.50   Min.   :0.360   no  :799   no  :717   Min.   : 81.33  
##  looking for work: 44   1st Qu.:22.86   1st Qu.:1.110   yes :113   yes :241   1st Qu.:107.67  
##  not working     :378   Median :26.32   Median :1.340   NA's: 88   NA's: 42   Median :117.33  
##  NA's            : 14   Mean   :27.34   Mean   :1.385                         Mean   :119.06  
##                         3rd Qu.:30.62   3rd Qu.:1.600                         3rd Qu.:127.33  
##                         Max.   :62.35   Max.   :4.030                         Max.   :202.00  
##                         NA's   :35      NA's   :88                            NA's   :61      
##       bili                          educ         HyperMed       creat           cohort         
##  Min.   :0.2000   Less than 9th grade : 90   no      : 25   Min.   :0.4300   Length:1000       
##  1st Qu.:0.6000   9-11th grade        :133   previous: 14   1st Qu.:0.6800   Class :character  
##  Median :0.7000   High school graduate:214   yes     :144   Median :0.8000   Mode  :character  
##  Mean   :0.7479   some college        :288   NA's    :817   Mean   :0.8387                     
##  3rd Qu.:0.9000   College or above    :275                  3rd Qu.:0.9500                     
##  Max.   :4.1000                                             Max.   :6.1800                     
##  NA's   :92                                                 NA's   :91                         
##       age            DBP             smoke          wgt           alc            WC        
##  Min.   :20.0   Min.   :  0.00   never  :624   Min.   : 38.56   0   :117   Min.   : 61.90  
##  1st Qu.:30.0   1st Qu.: 63.33   former :176   1st Qu.: 63.39   <=1 :293   1st Qu.: 83.10  
##  Median :42.0   Median : 70.00   current:199   Median : 75.30   1-3 :100   Median : 94.15  
##  Mean   :44.3   Mean   : 70.12   NA's   :  1   Mean   : 77.45   3-7 : 70   Mean   : 95.33  
##  3rd Qu.:58.0   3rd Qu.: 76.67                 3rd Qu.: 88.45   >7  :105   3rd Qu.:104.70  
##  Max.   :79.0   Max.   :110.00                 Max.   :163.29   NA's:315   Max.   :170.50  
##                 NA's   :61                     NA's   :16                  NA's   :60      
##       hgt                        race          chol          uricacid        DM         gender   
##  Min.   :1.397   Mexican American  :123   Min.   :2.070   Min.   : 1.800   no :861   male  :464  
##  1st Qu.:1.600   Other Hispanic    :119   1st Qu.:4.240   1st Qu.: 4.200   yes:139   female:536  
##  Median :1.676   Non-Hispanic White:346   Median :4.890   Median : 5.200                         
##  Mean   :1.681   Non-Hispanic Black:212   Mean   :4.985   Mean   : 5.303                         
##  3rd Qu.:1.753   other             :200   3rd Qu.:5.590   3rd Qu.: 6.300                         
##  Max.   :1.981                            Max.   :9.930   Max.   :11.200                         
##  NA's   :24                               NA's   :88      NA's   :91                             
##       albu     
##  Min.   :2.60  
##  1st Qu.:4.10  
##  Median :4.30  
##  Mean   :4.27  
##  3rd Qu.:4.50  
##  Max.   :5.20  
##  NA's   :91

Variable coding

It is important to check that all variables are coded correctly, i.e., have the correct class. Imputation software (e.g., the mice package) uses the class to automatically select imputation methods. When importing data from other software it can happen that factors become continuous variables or that ordered factors lose their ordering.

str() showed that in the NHANES data smoke and alc are correctly specified as ordinal variables, but educ is an unordered factor.

Task

Using levels(NHANES$educ) we can print the names of the categories of educ. Convert the unordered factor to an ordered factor, for example using as.ordered(). Afterwards, check if the conversion was successful.

Solution

## [1] "Less than 9th grade"  "9-11th grade"         "High school graduate" "some college"        
## [5] "College or above"
##  Ord.factor w/ 5 levels "Less than 9th grade"<..: 3 5 5 5 2 1 2 4 4 3 ...

Distribution of missing values

Missing data pattern

In the summary() we could already see that there are missing values in several variables. The missing data pattern can be obtained and visualized by several functions from different packages. Examples are

  • md.pattern() from package mice
  • md_pattern() from package JointAI (with argument patter = TRUE)
  • aggr() from package VIM
  • vis_dat() and vis_miss() from package visdat

Task

Explore the missing data pattern of the NHANES data.

Solution (i)

##     educ cohort age race DM gender smoke occup wgt hgt BMI hypten WC SBP DBP HDL hypchol chol creat
## 33     1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     1
## 534    1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     1
## 69     1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     1
## 149    1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     1
## 2      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     0
## 1      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1     0
## 20     1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0     0
## 26     1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0     0
## 9      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0     0
## 10     1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0     0
## 11     1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   1       1    1     1
## 2      1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   0       0    0     0
## 9      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1     1
## 8      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1     1
## 10     1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1     1
## 7      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1     1
## 3      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0     0
## 5      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0     0
## 2      1      1   1    1  1      1     1     1   1   1   1      1  0   0   0   1       1    1     1
## 3      1      1   1    1  1      1     1     1   1   1   1      1  0   0   0   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   1   1      0  1   1   1   1       1    1     1
## 12     1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   1       1    1     1
## 13     1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   1       1    1     1
## 3      1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   1       1    1     1
## 2      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1     1
## 2      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1     1
## 7      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   0       0    0     0
## 1      1      1   1    1  1      1     1     1   1   0   0      1  0   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      1  0   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      0  1   0   0   1       1    1     1
## 2      1      1   1    1  1      1     1     1   1   0   0      0  1   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      0  0   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   1   0   0      0  0   0   0   0       0    0     0
## 2      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1     1
## 2      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1     1
## 3      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     1   0   1   0      1  0   1   1   0       0    0     0
## 1      1      1   1    1  1      1     1     1   0   1   0      0  1   0   0   1       1    1     1
## 1      1      1   1    1  1      1     1     1   0   1   0      0  0   0   0   1       1    1     1
## 3      1      1   1    1  1      1     1     1   0   0   0      1  1   1   1   1       1    1     1
## 2      1      1   1    1  1      1     1     1   0   0   0      1  1   1   1   1       1    1     1
## 2      1      1   1    1  1      1     1     0   1   1   1      1  1   1   1   1       1    1     1
## 10     1      1   1    1  1      1     1     0   1   1   1      1  1   1   1   1       1    1     1
## 1      1      1   1    1  1      1     1     0   1   1   1      1  0   1   1   0       0    0     0
## 1      1      1   1    1  1      1     1     0   1   1   1      0  1   0   0   1       1    1     1
## 1      1      1   1    1  1      1     0     1   1   1   1      1  1   1   1   0       0    0     0
##        0      0   0    0  0      0     1    14  16  24  35     42 60  61  61  88      88   88    91
##     uricacid albu bili alc HyperMed     
## 33         1    1    1   1        1    0
## 534        1    1    1   1        0    1
## 69         1    1    1   0        1    1
## 149        1    1    1   0        0    2
## 1          1    1    0   1        0    2
## 2          0    0    0   1        0    5
## 1          0    0    0   0        1    5
## 20         0    0    0   1        1    7
## 26         0    0    0   1        0    8
## 9          0    0    0   0        1    8
## 10         0    0    0   0        0    9
## 11         1    1    1   1        1    2
## 2          1    1    1   0        1    3
## 1          0    0    0   0        1   10
## 9          1    1    1   1        1    1
## 8          1    1    1   1        0    2
## 10         1    1    1   0        1    2
## 7          1    1    1   0        0    3
## 3          0    0    0   1        0    9
## 1          0    0    0   0        1    9
## 5          0    0    0   0        0   10
## 2          1    1    1   0        1    4
## 3          0    0    0   0        1   11
## 1          1    1    1   1        0    2
## 12         1    1    1   1        0    4
## 13         1    1    1   0        0    5
## 3          0    0    0   0        0   12
## 1          1    1    1   1        0    5
## 2          1    1    1   0        0    6
## 1          0    0    0   1        0   12
## 1          0    0    0   0        0   13
## 1          1    1    1   1        1    2
## 2          1    1    1   1        0    3
## 1          1    1    1   0        1    3
## 7          1    1    1   0        0    4
## 1          0    0    0   1        0   10
## 1          1    1    1   0        1    4
## 1          1    1    1   0        1    6
## 1          1    1    1   1        0    6
## 2          1    1    1   0        0    7
## 1          1    1    1   0        0    8
## 1          0    0    0   0        0   15
## 2          1    1    1   1        1    2
## 1          1    1    1   1        0    3
## 2          1    1    1   0        1    3
## 3          1    1    1   0        0    4
## 1          0    0    0   0        1   11
## 1          1    1    1   0        0    7
## 1          1    1    1   0        0    8
## 3          1    1    1   1        0    4
## 2          1    1    1   0        0    5
## 2          1    1    1   1        1    1
## 10         1    1    1   1        0    2
## 1          0    0    0   0        1   10
## 1          1    1    1   0        0    6
## 1          0    0    0   0        0   10
##           91   91   92 315      817 2075

md.pattern() from package mice gives us a matrix where each row represents one missing data pattern (1 = observed, 0 = missing). The rownames show how many rows of the dataset have the given pattern. The last column shows the number of missing values in each pattern, the last row the number of missing values per variable.

md.pattern() also plots the missing data pattern automatically.

Solution (ii)

##      educ cohort age race DM gender smoke occup wgt hgt BMI hypten WC SBP DBP HDL hypchol chol
## 1       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 2       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 3       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 4       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 5       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0
## 6       1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0
## 7       1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   1       1    1
## 8       1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   1       1    1
## 9       1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   1       1    1
## 10      1      1   1    1  1      1     1     0   1   1   1      1  1   1   1   1       1    1
## 11      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1
## 12      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0
## 13      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   0       0    0
## 14      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1
## 15      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1
## 16      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1
## 17      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   1       1    1
## 18      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0
## 19      1      1   1    1  1      1     1     1   1   1   1      1  0   0   0   0       0    0
## 20      1      1   1    1  1      1     1     1   0   0   0      1  1   1   1   1       1    1
## 21      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0
## 22      1      1   1    1  1      1     1     1   1   1   1      0  1   0   0   0       0    0
## 23      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1
## 24      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1
## 25      1      1   1    1  1      1     1     1   1   0   0      0  1   0   0   1       1    1
## 26      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   1       1    1
## 27      1      1   1    1  1      1     1     1   0   0   0      1  1   1   1   1       1    1
## 28      1      1   1    1  1      1     1     1   1   1   1      1  0   0   0   1       1    1
## 29      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1
## 30      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 31      1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   1       1    1
## 32      1      1   1    1  1      1     1     0   1   1   1      1  1   1   1   1       1    1
## 33      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1
## 34      1      1   1    1  1      1     1     1   1   0   0      1  0   1   1   1       1    1
## 35      1      1   1    1  1      1     1     1   1   0   0      0  1   0   0   1       1    1
## 36      1      1   1    1  1      1     1     1   1   1   1      0  1   1   1   1       1    1
## 37      1      1   1    1  1      1     1     1   1   0   0      0  0   0   0   0       0    0
## 38      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   1       1    1
## 39      1      1   1    1  1      1     1     1   1   1   1      1  1   0   0   0       0    0
## 40      1      1   1    1  1      1     1     1   0   1   0      1  1   1   1   1       1    1
## 41      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 42      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   0       0    0
## 43      1      1   1    1  1      1     1     0   1   1   1      1  0   1   1   0       0    0
## 44      1      1   1    1  1      1     1     1   0   1   0      1  0   1   1   0       0    0
## 45      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1
## 46      1      1   1    1  1      1     1     1   0   1   0      0  1   0   0   1       1    1
## 47      1      1   1    1  1      1     0     1   1   1   1      1  1   1   1   0       0    0
## 48      1      1   1    1  1      1     1     1   1   1   1      1  0   1   1   0       0    0
## 49      1      1   1    1  1      1     1     1   1   1   1      1  1   1   1   1       1    1
## 50      1      1   1    1  1      1     1     1   0   1   0      0  0   0   0   1       1    1
## 51      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   0       0    0
## 52      1      1   1    1  1      1     1     1   1   0   0      1  0   0   0   1       1    1
## 53      1      1   1    1  1      1     1     1   1   1   1      0  0   0   0   0       0    0
## 54      1      1   1    1  1      1     1     1   1   0   0      1  1   1   1   1       1    1
## 55      1      1   1    1  1      1     1     1   1   0   0      0  0   0   0   1       1    1
## 56      1      1   1    1  1      1     1     0   1   1   1      0  1   0   0   1       1    1
## Nmis    0      0   0    0  0      0     1    14  16  24  35     42 60  61  61  88      88   88
##      creat uricacid albu bili alc HyperMed Npat
## 1        1        1    1    1   1        0  534
## 2        1        1    1    1   0        0  149
## 3        1        1    1    1   0        1   69
## 4        1        1    1    1   1        1   33
## 5        0        0    0    0   1        0   26
## 6        0        0    0    0   1        1   20
## 7        1        1    1    1   0        0   13
## 8        1        1    1    1   1        0   12
## 9        1        1    1    1   1        1   11
## 10       1        1    1    1   1        0   10
## 11       1        1    1    1   0        1   10
## 12       0        0    0    0   0        0   10
## 13       0        0    0    0   0        1    9
## 14       1        1    1    1   1        1    9
## 15       1        1    1    1   1        0    8
## 16       1        1    1    1   0        0    7
## 17       1        1    1    1   0        0    7
## 18       0        0    0    0   0        0    5
## 19       0        0    0    0   0        1    3
## 20       1        1    1    1   1        0    3
## 21       0        0    0    0   1        0    3
## 22       0        0    0    0   0        0    3
## 23       1        1    1    1   0        0    3
## 24       1        1    1    1   0        1    2
## 25       1        1    1    1   0        0    2
## 26       1        1    1    1   0        0    2
## 27       1        1    1    1   0        0    2
## 28       1        1    1    1   0        1    2
## 29       1        1    1    1   1        0    2
## 30       0        0    0    0   1        0    2
## 31       1        1    1    1   0        1    2
## 32       1        1    1    1   1        1    2
## 33       1        1    1    1   1        1    2
## 34       1        1    1    1   0        1    1
## 35       1        1    1    1   1        0    1
## 36       1        1    1    1   1        0    1
## 37       0        0    0    0   0        0    1
## 38       1        1    1    1   1        0    1
## 39       0        0    0    0   0        1    1
## 40       1        1    1    1   1        0    1
## 41       1        1    1    0   1        0    1
## 42       0        0    0    0   0        0    1
## 43       0        0    0    0   0        1    1
## 44       0        0    0    0   0        1    1
## 45       1        1    1    1   0        1    1
## 46       1        1    1    1   0        0    1
## 47       0        0    0    0   0        0    1
## 48       0        0    0    0   0        1    1
## 49       0        0    0    0   0        1    1
## 50       1        1    1    1   0        0    1
## 51       0        0    0    0   1        0    1
## 52       1        1    1    1   0        1    1
## 53       0        0    0    0   1        0    1
## 54       1        1    1    1   1        1    1
## 55       1        1    1    1   0        0    1
## 56       1        1    1    1   0        0    1
## Nmis    91       91   91   92 315      817 2075

The function md_pattern() from package JointAI gives a matrix very similar to the one obtained from mice::md.pattern(). Hovever, here, the number of rows in the data that have a particular missing data pattern are given in the last column.

For more information on how to customize the visualization by md_pattern() see the vignette Visualizing Incomplete Data on the webpage of JointAI.

Solution (iii)

aggr() from VIM plots a histogram of the proportion of missing values per column as well as a visualization of the missing data pattern. Here, a small histogram on the right edge of the plot shows the proportion of cases in each pattern.

For more options how to customize the plot, see the VIM documentation.

Proportion of missing values

Task

Now get an overview of

  • how much missingness there is in each variable, and
  • the proportion of (in)complete cases.

The function complete.cases() will give you a vector that is TRUE if the the case is complete and FALSE if there are missing values.

is.na() returns TRUE if the value is missing, FALSE if the value is observed

colSums() calculates the sum of values in each column of a data.frame or matrix

colMeans() calculates the mean of values in each column of a data.frame or matrix

Solution

##              #    %
## complete    33 96.7
## incomplete 967  3.3
##          # NA % NA
## educ        0  0.0
## cohort      0  0.0
## age         0  0.0
## race        0  0.0
## DM          0  0.0
## gender      0  0.0
## smoke       1  0.1
## occup      14  1.4
## wgt        16  1.6
## hgt        24  2.4
## BMI        35  3.5
## hypten     42  4.2
## WC         60  6.0
## SBP        61  6.1
## DBP        61  6.1
## HDL        88  8.8
## hypchol    88  8.8
## chol       88  8.8
## creat      91  9.1
## uricacid   91  9.1
## albu       91  9.1
## bili       92  9.2
## alc       315 31.5
## HyperMed  817 81.7

Relationship between variables

In the missing data pattern we could already see that some variables tend to be missing together. But there may be different types of relationships between variables that are of interest.

Task 1

Our data contains hgt (height), wgt (weight) and BMI. Check the missing data pattern specifically for those three variables.

Solution 1

##      wgt hgt BMI Npat
## 1      1   1   1  965
## 2      1   0   0   19
## 3      0   1   0   11
## 4      0   0   0    5
## Nmis  16  24  35   75
## , ,  = BMI mis.
## 
##              
##               weight mis. weight obs.
##   height mis.           5          19
##   height obs.          11           0
## 
## , ,  = BMI obs.
## 
##              
##               weight mis. weight obs.
##   height mis.           0           0
##   height obs.           0         965

As we have already seen in the lecture, there are some cases where only either hgt or wgt is missing. BMI is only observed when both components are observed. To use all available information, we want to impute hgt and wgt separately and calculate BMI from the imputed values.

Task 2

The data contains variables on hypertension (hypten) and the use of medication to treat hypertension (HyperMed). We might expect that medication is only prescribed for patients with hypertension, hence, we need to investigate the relationship between HyperMed and hypten.

  • Make a table of the two variables to confirm our expectation.
  • Make sure that missing values are also displayed in that table!

Solution 2

##           hypten
## HyperMed    no yes <NA>
##   no         0  25    0
##   previous   0  14    0
##   yes        0 144    0
##   <NA>     717  58   42

Task 3

Furthermore, we can expect a systematic relationship between the variables hypchol (Hypercholesterolemia) and chol (serum cholesterol). Find out how these two variables are related.

Solution 3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   2.070   4.140   4.760   4.715   5.350   6.180      88
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   6.210   6.410   6.650   6.896   7.140   9.930      88

It seems that hypchol is defined as chol > 6.2, which makes hypchol dependent on chol.

Task 4

Visualize the correlations between variables to detect associations that you may have missed.

Solution 4

Note: Correlations involving categorical variables are not usually done! We only use this as a quick-and-dirty method for visualization of relationships.

The questionmaks in the plot indicate that no correlation could be calculated for cohort (because cohort is the same for all subjects) and neither between HyperMed and hypten (but we already knew that).

The plot shows that besides the relations we were already aware of, wgt is strongly correlated with WC. Comparing the number of cases where only either wgt or WC is missing shows that there are 14 cases where wgt is missing but WC is observed and 58 cases where WC is missing and wgt is observed.

##          
##           wgt mis. wgt obs.
##   WC mis.        2       58
##   WC obs.       14      926

Distribution of the data

Before imputing missing values it is important to take a look at how the observed parts of incomplete variables are distributed, so that we can choose appropriate imputation models.

Task

Visualize the distributions of the incomplete continuous and categorical variables. The package JointAI provides the convenient function plot_all(),

Pay attention to

  • whether continuous distributions deviate considerably from the normal distribution,
  • if variables have values close to the limits of the range they are defined in,
  • whether categorical variables are very unbalanced (i.e., some category very small).

Solution

To learn more about additional options of plot_all() check out the vignette Visualizing Incomplete Data on the webpage of JointAI.

 

© Nicole Erler